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Abstract 

Several methods for estimating item response theory seores for multiple subtests were eompared. 
These methods ineluded two multidimensional item response theory models: a bi-faetor model 
where eaeh subtest was a eomposite seore based on the primary trait measured by the set of tests 
and a seeondary trait measured by the individual subtest, and a model where the traits measured 
by the subtests were separate but eorrelated. Composite seores based on unidimensional item 
response theory, with eaeh subtest borrowing information from the other subtest, as well as 
independent unidimensional seores for eaeh subtest were also eonsidered. Correlations among 
seores from all methods were high, though somewhat lower for the independent unidimensional 
seores. Correlations between eourse grades and test seores, a measure of validity, were similar 
for all methods, though again slightly lower for the unidimensional seores. To assess bias and 
RMSE, data were simulated using the parameters estimated for the eorrelated faetors model. The 
independent unidimensional seores showed the greatest bias and RMSE; the relative 
performanee of the other three methods varied with the subseale. 
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Scoring Subscales using Multidimensional Item Response Theory Model 

Tests are often designed sueh that eaeh item measures the primary trait and one additional 
seeondary trait. The seeondary traits may refleet different eontent eategories in the test blueprint, 
or different tests within a battery of tests. In this situation, test users may want subseale seores, 
eaeh of whieh refleets both the primary trait and the relevant seeondary trait. Two 
multidimensional item response theory (MIRT) models are potentially useful in this eontext: a 
model with n eorrelated traits, where n is the number of subseales, or a bi-faetor model with one 
primary trait and n orthogonal seeondary traits. An additional model, whieh applies 
unidimensional IRT in the initial seoring of eaeh subseale but then borrows information from 
eorrelated subseales in forming the final subseale seores, eould also be applied. 

In the bi-faetor model, all items are speeified to load on the primary faetor. Additionally, 
eaeh item may load on one additional faetor. The faetors are orthogonal (Gibbons & Hedeker, 
1992; MeLeod, Swygert, & Thissen, 2001). In other words, a seeondary faetor is the eommon 
faetor a group of items shares beyond their shared assoeiation with the primary faetor. 
Hierarchical is a more general term for this elass of models; bi-factor emphasizes that eaeh item 
loads on no more than two traits, ineluding the primary trait. With the bi-faetor model, seores ean 
be estimated for the primary trait and eaeh seeondary trait. On a battery of tests, though, it would 
seem desirable for eaeh subtest seore to be a measure of the overall eonstruet eovered in the 
subtest, not just the part of the eonstruet not eovered by the primary faetor. In other words, the 
seore should be a eombination of the primary trait and the seeondary trait, not just the seeondary 
trait. To quantify the relative weights of the faetors eontributing to an item response, Reekase 
(1985; 1997; Reekase & MeKinley,1991) defined the direetion of greatest slope for item i as 
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where aik is the angle with axis k and aik is the diserimination parameter for trait k. In the bi- 
faetor model, it is simplest to view the angle for eaeh item relative to the primary faetor. If an 
item measured only the primary trait, a would be 0; if an item measured the primary trait and 
seeondary trait equally, a would be 45°. Though eaeh item may measure a slightly different 
eomposite of the primary and seeondary traits, an average eomposite eould be formed for eaeh 
subtest, based on the average angle with the primary axis for the items in the subtest. 
Multidimensional IRT Approach 

TESTFACT (Boek, Gibbons, Sehilling, Muraki, Wilson, & Wood, 2003) and NOHARM 
(Fraser, 1988) both estimate the item parameters for multidimensional normal ogive models for 
diehotomous items. The model estimated is: 

Pi(0) = ei+(l-ei)O(a;e + di), (2) 

where Pi(0) is the probability of eorreet response on item i given the 0 veetor of abilities and the 
item parameters, O indieates the eumulative standard normal distribution, Oi is the lower 
asymptote, ai is a veetor of diserimination parameters, and di is the item diffieulty. In eontrast to 
the eommon unidimensional models, di is added, not subtraeted, so easier items have higher 
values for d. TESTFACT uses full information maximum likelihood estimation for estimating 
the parameters (Boek, Gibbons, & Muraki, 1988; Gibbons & Hedeker, 1992; Muraki & 
Engelhard, 1985), while NOHARM uses bivariate information (proportion eorreet for eaeh item 
and joint proportion eorreet for eaeh pair of items) and estimates a polynomial approximation to 
the normal ogive model (MeDonald, 1997; 1999). 
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Though both TESTFACT and NOHARM use the normal ogive models, the parameters 
estimated should be virtually the same as those of the multidimensional logistie IRT model, 



whieh may be more familiar to some readers: 



Pi(0) = Ci+(l-Ci)- 



d.7(a;e+dj 



where the parameters are as defined for Equation 2. 

TESTFACT and NOHARM parameter estimates have been eompared for exploratory 
models. Zhang and Stone (2004) found that TESTFACT and NOHARM produeed very similar 
parameter estimates when used in an exploratory mode with items measuring two uneorrelated 
faetors. Knol and Berger (1991) reeovered the parameters of one-, two-, and three-dimensional 
exploratory IRT models using TESTFACT, NOHARM, MAXEOG, and traditional linear faetor 
analysis methods. Of the non-linear (IRT) paekages, MAXEOG had the largest RMSE between 
the parameter estimates and true parameters. TESTFACT performed slightly better than 
NOHARM. Miller (1991) reviewed earlier eomparisons of TESTFACT, MIRTE, and 
MULTIDIM by Aekerman, and using the same data set added a study of NOHARM. The mean 
and standard deviation of the residuals were similar for NOHARM and TESTFACT, a bit higher 
for MULTIDIM, and quite large for MIRTE. Examining the reeovering of individual item 
parameter within NOHARM, the standard errors and bias were large for the anehor items, and 
ehoosing items with average levels of diffieulty and diserimination or items expeeted to load 
highly on a given dimension as anehors did not help. 

Gosz and Walker (2002) eompared the probability of eorreet response based on the true item 
parameters and the item parameters estimated in TESTFACT and in NOHARM. True theta 
values were used with the reeovered item parameters in the probability ealeulations. Items were 
generated Ifom a two-dimensional extension of the 2PL model; in some eonditions one subset of 
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items loaded only on faetor and the remaining items loaded only on faetor two, while in other 
eonditions a subset of items loaded on both faetors. Exploratory analyses, with two faetors, were 
used to reeover the item parameters. The probabilities of eorreet response based on the 
NOHARM item parameter estimates were eloser to the true probabilities than were those based 
on the TESTFACT estimates. NOHARM's performanee was partieularly better than 
TESTFACT's for items with a higher diserimination value on the seeond faetor. 

De Champlain and Gessaroli (1998) tested for unidimensionality using TESTFACT and 
NOHARM with small samples and short tests. In NOHARM, unidimensionality was assessed by 
fitting a one-faetor model and using a fit index based on the standardized residuals of the 
interitem joint-probability matrix. In TESTFACT, both unidimensional and exploratory two- 
dimensional models were estimated, and the differenee in log-likelihood fit was ealeulated. 
When the data were unidimensional, TESTFACT had high Type I error rates, while the error rate 
for NOHARM was elose to the nominal value. When the data were two-dimensional, both 
methods had high power when the traits were uneorrelated or when the test had 40 items, but 
NOHARM had higher power when the traits were eorrelated and the test had only 20 items. 

None of these studies eompared NOHARM and TESTFACT when used for eonfirmatory 
analysis of multidimensional models. Also, these studies did not eompare different eonfirmatory 
models sueh as the bi- faetor model with a model with eorrelated faetors. 

Alternative to Multidimensional Models 

When several tests are administered to examinees, or when a test ineludes subseales, the 
parameters for eaeh subtest or subseale may be estimated separately. In addition to avoiding 
dependeney problems due to items on a subseale sharing an additional faetor beyond the primary 
test trait, this approaeh allows for estimation of several trait seores. The potential drawbaek of 
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this approach is that the subtests may be quite short and unreliable. Wainer et al. (2001) 
developed a seore augmentation approaeh that uses information from the other subseales in 
estimating a subseale seore. This method ean be used with both elassieal test theory and IRT- 
based seores. The extent of the influenee of the other subseores depends on the intereorrelations 
of the subseores and their reliabilities. For example, a subtest that had low reliability and was 
highly eorrelated with another subseale with high reliability would be affeeted more. Augmented 
subseale seores are estimated by: 

Tj + -X.), (4) 

where x j is a veetor of augmented subseale seores for examinee j, xj is a veetor of unaugmented 

subseale seores for person) and x, is a veetor of subseale means, and and are the 
estimated true and observed varianee-eovarianee matriees. When applied to Bayesian IRT seores 
(expeeted a-posterior (EAP) or modal a-posterior (MAP)), the bias of the seores toward the 
subseale means is first adjusted for, based on the subseale marginal reliability. Eaeh diagonal 
element of and is divided by the squared marginal reliability of the eorresponding 
subseale, and eaeh off-diagonal element is divided by the produet of the reliabilities of the 
eorresponding subseale. Eaeh subseale seore in xj is also divided by the marginal reliability. The 
veetor x. will be a veetor of O's if the IRT metrie has been sealed so that the estimated population 
mean of eaeh subseale is zero. 

Gessaroli (2004) applied Wainer et al's (2001) method to a test with four eorrelated 
subseores, using the number-eorreet seores. He also estimated seores with the bi-faetor model; 
using the thetas and item parameter estimates from the bi-faetor model he ealeulated the 
expeeted seore on the number-eorreet metrie. Results from the two methods were essentially 
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identical, and these methods yielded seores with greater reliability/smaller standard errors than 
raw seores. 

Purpose of the study 

In this study, estimated seores from the bi-faetor model, a multidimensional model with 
eorrelated faetors, and the augmented seore approaeh were examined. The bi-faetor model was 
estimated in TESTFACT and NOHARM, and the subseale seores were estimated as a weighted 
eombination of the primary faetor and the assoeiated subseale faetor. An additional model was 
estimated in NOHARM: the items on eaeh subseale formed a faetor and the faetors were free to 
eorrelate. Separate unidimensional models for eaeh subseale were also estimated, and Wainer et 
al.'s (2001) data augmentation method was used to estimate seores for eaeh subseale. In addition, 
the unaugmented unidimensional seores themselves were examined. Note that all of these 
methods used IRT seores (thetas); unlike Gessaroli's (2004) study the IRT seores were not used 
to estimate seores in the number-eorreet metrie. These results would be most applieable when 
standardized test seores are a direet transformation of the IRT thetas. 

Method 

Participants 

Seeond and third-year students at James Madison University partieipated in this study. The 
university requires students with 45-70 eredit hours to partieipate in assessment aetivities eaeh 
spring. The students are randomly assigned to assessment instruments eovering different areas of 
general edueation or student development. Over the eourse of three years, 2552 students 
eompleted the two tests seleeted for this study. 



Instruments and Procedures 
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Two multiple-choice assessment tests were used for this study, a seale eovering Ameriean 
history and politieal seienee and a seale eovering global issues. Though the tests administered to 
the students were longer, 20 of the Ameriean items and 15 of the global items were used in this 
study to ereate a situation where the seales were short enough that reliability would be inereased 
by using a multidimensional model or augmented seores. The tests were presented at the same 
testing session along with varying other instruments. Tests were not strietly timed but when most 
students had finished a test the others were given an additional five minutes to finish. Test seores 
did not appear on student transeripts and were used only in the aggregate for program 
assessment. 

Estimation 

Item parameters for the bi-faetor model were estimated twiee, onee using TESTFACT 4.0' 
and onee using NOHARM. NOHARM does not provide ability estimates, so EAP (expeeted a- 
posteriori) ability estimates were estimated using TESTFACT 4.0, onee using the item 
parameters Ifom TESTFACT and again using the item parameters Ifom NOHARM. Ability was 
estimated for both the primary and seeondary faetors (Ameriean history was one seeondary 
faetor and global issues was the other). Nine evenly spaeed quadrature points Ifom -4 to 4 were 
used for eaeh dimension. For eaeh of the two subtests, the average angle with the primary trait 
axis was ealeulated. Based on this angle, the subseale ability was ealeulated as a weighted linear 
eomposite. The ratio of the weight for 02 to the weight for 0i used in forming the eomposite 
seore is equal to the tangent of this angle (Aekerman, 1991). While any weights with this ratio 
eould be used, the weights used here were ehosen sueh that the sum of their squares was equal to 
one; beeause the 0s were uneorrelated and they were sealed sueh that their estimated varianees 
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were 1, this would result in a composite score with an estimated variance of 1 (the observed 
varianee of the Bayesian ability estimates was of eourse less than 1 due to shrinkage). 

A model with two correlated faetors (based on the two subscales) was also estimated in 
NOHARM. Because TESTFACT only ealculates ability estimates for uneorrelated faetors, and 
NOHARM does not provide ability estimates, a routine for EAR estimation was written using 
SAS. The quadrature points for each dimension were the same as those used in TESTFACT for 
orthogonal factors, but the prior densities at eaeh point were based on a bivariate normal 
distribution with a correlation equal to the correlation between the factors estimated in 
NOHARM. 

Wainer et al.'s (2001) data augmentation method uses scores estimated for each subscale 
separately. BIEOG-MG 3.0 (Zimowski, Muraki, Mislevy, & Boek, 2003) was used to estimate a 
three-parameter logistic (3-PE) unidimensional model for each subscale. EAR seores were 
estimated for eaeh subscale within BIEOG-MG as well. Augmented subscale scores were then 
calculated following Equation 4 with the adjustments noted for EAR scores. 

Results 

Bi-factor model. For the bi-faetor model, using NOHARM item parameter estimates the 
average angle with the primary factor axis was 21 degrees for the Ameriean subtest and 30 
degrees for the global subtest. The angles based on the TESTFACT item parameter estimates 
were similar; 27 and 25 degrees. Thus, both subscales were measuring the primary dimension 
more than the secondary dimensions. The resulting weighted composites, based on these average 
angles, were 0American = 0.93 0i + 0.36 02 and 0Giobai = 0.86 0i + 0.50 0s for NOHARM, and 



0American 0.90 01 + 0.43 02 and 0Giobai = 0.91 0i + 0.42 0s for TESTFACT. 
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2-Factor model. When the items were ealibrated with a 2-factor model in NOHARM, the 
estimated correlation between the faetors was .81. Thus, the prior distribution used for the EAP 
estimation of the 0s was a bivariate normal distribution with a eorrelation of .81. For each 
examinee, the posterior distribution of 0i and 02 was approximated using quadrature methods as 
described in the Method section; the joint distribution was marginalized over each dimension in 
turn, and the mean of this distribution taken as the EAP seore. 

Augmented Subscale Scores. The estimated marginal reliability was .76 for the 
unidimensional American scores and .68 for the global scores. The varianees of the score 
estimates were .77 and .68 (EAP scores are biased inward, so the variances of the estimated 
seores are lower than the estimated variances of the scores), with a covarianee of .42. 
Substituting these numbers into Equation 4 and adjusting for the bias in the EAP estimates, the 
functions for the augmented scores were: 0American(augmented) = .64Z.76 0American + .20/.68 0Giobai and 
0Global(augmented) .297.76 0American + .527.68 0Global. 

Comparison of Score Estimates. Tables 1 and 2 show the mean, standard deviation, 
minimum, and maximum of the score estimates. The means for all methods are essentially zero. 
The independent unidimensional models method led to scores with a smaller range, and 
somewhat smaller standard deviation in the ease of the global subtest; the more extreme scores 
were pulled towards the mean more with the unidimensional models, presumably beeause of 
lower reliability. 

Correlations among the scores are shown in Tables 3 and 4. With the exception of the 
separate unidimensional models, correlations among the scores based on different methods were 
all at least 0.99. The unidimensional scores had eorrelations of at least 0.94 with scores from the 



other methods. 
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Comparison of Correlations with Course Grades. The objectives on the test blueprints 
matched the objectives for general education courses. For the Ameriean history and political 
science eurrieulum, students selected one of two courses: US history or US politieal science. For 
global issues, students ehose from five courses designed to address the global issues currieular 
objectives; these courses were offered in the departments of anthropology, eeonomies, 
geography, political science, and sociology. Correlations between the test seores and the relevant 
course grades are shown in Tables 5 and 6. Correlations varied depending on the eourse; within 
each course, the unidimensional scores eonsistently had a slightly smaller eorrelation with the 
course grades. 

Comparison of Bias and RMSE. Because real data were used in this example, it is difficult to 
know which model produces the most aecurate estimates. Simulations were run to assess bias 
and RMSE of the ability estimates. For these simulations, the item parameters estimated from the 
real data using the two-faetor model were used as the generating, or true, item parameters. A 
sample of 2500 simulees were drawn from a bivariate normal distribute with a correlation of .81 
between the abilities. Item responses were simulated for eaeh of 100 replications, using the 
logistic parameterization in Equation 3 for eonvenience. Item parameters were recovered for 
eaeh replieation, and the ability parameters were estimated based on the item parameter 
estimates, not the generating item parameters. Thus, errors in item parameter estimation and 
errors in estimating the coefficients for the linear eombinations used in the bi-faetor and 
augmented seore methods were taken into aecount. The simulation was eonducted for the bi- 
factor model, the 2-factor model, the augmented approach, and the independent unidimensional 
models; for the bi-factor model, because the scores were virtually identical using TESTEACT 
and NOHARM, only TESTEACT was used in the simulation. 
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Figures 1 and 2 show the bias across the ability range. As would be expeeted with EAP 
seores, seores were biased toward the mean. For the American subtest, the augmented scores 
were the least biased and the unidimensional scores were the most biased. Bias was slightly 
greater for the bi- factor model than for the 2-factor model. For the Global subtest, both the 
augmented scores and the unidimensional scores were more biased than the bi-faetor and 2- 
factor models, and bias was again slightly greater for the bi-faetor model than for the 2-factor 
model. 

Figures 3 and 4 show the RMSE aeross the ability range. Eor both subtests, RMSE tended to 
be greatest for the unidimensional method. Eor the American subtest, in the center of the ability 
distribution both the augmented and unidimensional seores had somewhat greater RMSE than 
the multidimensional IRT approaehes. At abilities below -1.5 and above 1.5, RMSE was lowest 
for the augmented scores and greatest for the unidimensional scores, with RMSE for the bi- factor 
and 2-factor models in the middle. Eor the Global subtest, in the eenter for the distribution 
RMSE was slightly lower for the augmented scores, but at more extreme scores below -1.5 or 
above 1.5, RMSE was lower for the bi- factor and 2-factor models. This is opposite the pattern 
seen for the American subtest. These differing patterns were consistent with the results for bias; 
the bias for the American scores was lowest for the augmented scores so the augmented scores 
might be expected to have lower RMSE in the extremes providing they did not have mueh 
greater standard deviations. Similarly, the bias for Global scores was lower for the 
multidimensional IRT approaches, so at the extremes these scores might be expected to have 
lower RMSE, again assuming they did not have appreeiably larger standard deviations. 
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Discussion, Limitations, and Conclusions 

Scores on the American history and political science test and the global issues test used in 
this study were nearly the same based on the TESTFACT and NOHARM bi- factor models, a 2- 
correlated-factor model, and the augmented score approach. Correlations among these methods 
were .99 or higher and the mean, standard deviation, and range of scores were similar. Seores 
based on two separate unidimensional models had slightly lower, though still high, correlations 
with the other models, but a smaller standard deviation and range, presumably beeause they were 
biased toward the mean more than the scores from the other approaches. Correlations between 
the scores and course grades were slightly lower for the independent unidimensional model. For 
any one course the differenee was so small it would not be worth mentioning, except that this 
correlation was consistently the smallest eorrelation for each course. 

The simulation study showed the unidimensional scores were more biased and had higher 
RMSE. The relative bias and RMSE for the other approaches differed on the two tests. The bi- 
factor and 2-factor models showed very similar levels of bias and RMSE; on one test higher than 
the augmented scores at the extremes, and on the other test lower. Based on these results, there is 
no elear advantage for any of these three methods over the others, but all produced lower bias 
and RMSE than the separate unidimensional models. 

A limitation of using a real data set, or item parameters based on a real data set for the 
simulation study, is that it is diffieult to know how the results will generalize to other tests. On 
the other hand, with a real data set the results are at least realistic for one situation. These results 
might generalize to other situations where the subtests are of moderately short length (15-20 
items) and measure relatively similar skills and are administered at the same time. They might be 
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less generalizable to longer or shorter subtests or to those which measure more disparate skills or 
are administered further apart in time. 

The implications for this study are that with tests of 15-20 items, less biased scores with 
smaller standard errors can be obtained using a multidimensional IRT model or Wainer et al.'s 
(2001) augmented score approaeh. The augmented scores might be the least arduous to calculate; 
for this study BILOG-MG was used for estimating the unidimensional scores and the IML 
proeedure within SAS was used to estimate the coefficients for eombining these seores. The 
correlated-factor model might be most eonceptually appealing, but I am aware of no commereial 
software whieh will estimate IRT seores for eorrelated factors^, so caleulation of scores must be 
done separately and requires a geometrieally inereasing number of quadrature points as the 
number of factors increases. 
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Footnotes 

' Beeause TESTFACT only provides ability estimates for the primary faetor when the bi- 
factor model is used, the item parameter file was re-eonfigured to match the format of the item 
parameter file Ifom a three-factor model, with a-parameters set to 0 for the secondary factor not 
measured by an item, and the ability estimates were obtained in a second run using this 
parameter file. This was possible beeause the factors in the bi- faetor model are orthogonal; 
TESTFACT estimates abilities only for orthogonal factors. As a check, scores for the primary 
factor were estimated along with the item parameters and eompared to the estimates in the 
second run after re-eonfiguring the item parameter file; the score estimates were identical and the 
estimated standard errors were nearly the same. 

^ To be accurate, CONQUEST (Wu, Adams, & Wilson, 1998) will estimate EAP and ML 
seores for multidimensional IRT models, but only for models that are an extension of the Rasch 
family of models — models with constant discrimination parameters. 
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Table 1 



Distribution of American History/Political Science Subscale Scores 





Mean 


SD 


Minimum 


Maximum 


TESTFACT Bi-factor Model 


0.00 


0.90 


-2.76 


2.20 


NOHARM Bi-factor Model 


-0.01 


0.88 


-2.68 


2.18 


NOHARM 2-Factor Model 


0.00 


0.91 


-2.78 


2.21 


Augmented Subscale Method 


0.01 


0.90 


-2.71 


2.03 


Independent Flnidimensional Model 


0.01 


0.87 


-2.52 


1.91 


Table 2 










Distribution of Global Issues Subscale Scores 










Mean 


SD 


Minimum 


Maximum 


TESTFACT Bi-factor Model 


0.00 


0.87 


-2.92 


2.05 


NOHARM Bi-faetor Model 


-0.01 


0.86 


-2.87 


2.03 


NOHARM 2-Factor Model 


0.00 


0.88 


-2.91 


2.04 


Augmented Subscale Method 


0.01 


0.87 


-2.78 


1.84 


Independent Unidimensional Model 


0.01 


0.82 


-2.55 


1.44 
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Table 3 

Correlations among American History/Political Science Subscale Scores 





TESTFACT Bi- 
faetor Model 


NOHARM Bi- 
faetor Model 


NOHARM 2- 
Factor Model 


Augmented 
Subseale Method 


NOHARM Bi- 
factor Model 


1.000 








NOHARM 2- 
Factor Model 


0.995 


0.995 






Augmented 
Subseale Method 


0.997 


0.996 


0.997 




Independent 

Unidimensional 

Model 


0.970 


0.971 


0.981 


0.975 


Table 4 










Correlations among Global Issues Subscale Scores 








TESTFACT Bi- 
factor Model 


NOHARM Bi- 
factor Model 


NOHARM 2- 
Factor Model 


Augmented 
Subseale Method 


NOHARM Bi- 
factor Model 


0.999 








NOHARM 2- 
Factor Model 


0.991 


0.989 






Augmented 
Subseale Method 


0.994 


0.991 


0.996 




Independent 

Unidimensional 

Model 


0.943 


0.944 


0.962 


0.950 
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Table 5 

Correlations among Course Grades and American History/Political Science Subscale Scores 



Course 





History (A= 861) 


Political Science (A= 319) 


TESTFACT Bi-factor Model 


.32 


.36 


NOHARM Bi-factor Model 


.32 


.36 


NOHARM 2-Factor Model 


.31 


.36 


Augmented Subscale Method 


.32 


.36 


Independent Unidimensional Model 


.29 


.35 



Table 6 

Correlations among Course Grades and Global Issues Subscale Scores 



Course 


anthropology economics 

(A=293) (A=741) 


geography 

(A=315) 


political sci. 
(A =90) 


sociology 

(A=221) 



TESTFACT Bi- 
factor Model 


.37 


.29 


.33 


.66 


.32 


NOHARM Bi- 
factor Model 


.37 


.29 


.33 


.66 


.31 


NOHARM 2- 
Faetor Model 


.38 


.30 


.32 


.66 


.31 


Augmented 
Subscale Method 


.38 


.30 


.32 


.67 


.31 


Independent 

Unidimensional 

Model 


.35 


.28 


.30 


.65 


.24 
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Figure Captions 

Figure 1. Bias of Ameriean History/Political Seience Scores 
Figure 2. Bias of Global Issues Scores 

Figure 3. RMSE of American History/Political Science Scores 
Figure 4. RMSE of Global Issues Seores 




Theta 1 (American History) 

method Augmented NOHARM 2 -factor 

TESTFACT Bi - factor unidimensional 
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Theta 2 (Global Issues) 

method Augmented NOHARM 2 -factor 

TESTFACT Bi - factor unidimensional 






Tlieta 1 (American History) 

Augmented ••• NOHARM 2 -factor 

TESTFACT Bi -factor unidimensional 




Theta 2 (Global Issues) 



® Augmented NOHARM 2 -factor 

^ TESTFACT Bi -fector unidimensional 





